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Abstract 

Many real-world networks are large, complex and thus hard to understand, analyze or 
visualize. Data about networks are not always complete, their structure may be hidden, 
or they may change quickly over time. Therefore, understanding how an incomplete 
system differs from a complete one is crucial. In this paper, we study the changes in 
networks submitted to simplification processes (i.e., reduction in size). We simplify 30 
real-world networks using six simplification methods and analyze the similarity between 
the original and simplified networks based on the preservation of several properties, for 
example, degree distribution, clustering coefficient, betweenness centrality, density and 
degree mixing. We propose an approach for assessing the effectiveness of the simplifica¬ 
tion process to define the most appropriate size of simplified networks and to determine 
the method that preserves the most properties of original networks. The results reveal 
that the type and size of original networks do not affect the changes in the networks 
when submitted to simplification, whereas the size of simplified networks does. More¬ 
over, we investigate the performance of simplification methods when the size of simplified 
networks is 10% that of the original networks. The findings show that sampling meth¬ 
ods outperform merging ones, particularly random node selection based on degree and 
breadth-first sampling. 
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1. Introduction 

Over the past decade, network analysis has proved to be a suitable tool for 

describing diverse systems, understanding their structure and analyzing their properties. 
However, the evolution of the Web and the capability of storing large amounts of data 
have caused the size of networked systems and thus their complexity to increase. The 
algorithms for analyzing and visualizing networks appear impractical for addressing very 
large systems. Therefore, different methods have been proposed for the simplification of 
complex networks. 
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Simplification is a process that reduces the size of a network by decreasing the number 
of nodes and links. The procedure is derived from graph theory (e.g., partitioning Q 
and blockmodeling i) and was initially developed for compression and efficient graph 
storage [^, . With the increasiM complexity of networks, simplification methods also 

support clearer visualization [3, 0 and efficient analysis [ 3 , [13 ■ In addition to these 
benefits, analyzing the changes undergone by networks under the effects of the simplifi¬ 
cation process enables us to explore and explain the differences between complete (i.e., 
original) and incomplete (i.e., simplified) systems (e.g., when only partial insight into the 
structure of network is available). 

Recently, network simplification has been extensively investigated from different per¬ 
spectives. Some studies have concentrated on the simplification of specific networks, 
such as simplifying social networks based on stability and retention 11| , sampling scale- 
free [13 or directed networks [l^ . estimating different properties under social network 
crawling 1^, sampling large dynamic peer-to-peer networks with random walks or 
simplifying flow networks by removing useless links 1^. Other studies have attempted 


to provide a sufficient fit to original networks and thus observe the changes in network 
properties under the effects of simplification, such as preserving the clustering coeffi¬ 
cient [13, degree distribution [ 13 , community structure [ 13 , spectral properties [13 or 


network connectivity 21|. 


However, only a few studies have focused on comparing simplification methods and 
measuring their success. Leskovec et al. 0 observed properties of original and simplified 
networks submitted to several simplification methods and measured their success based 
on random walk similarity. Lee et al. [13 analyzed basic network properties under the 
effects of three simplification methods and revealed characteristic patterns of changes in 
properties. Hiibler et al. [^ compared their simplification algorithm to existing ones by 
measuring the average distance of properties between original and simplified networks. 
Toivonen et al. studied the compression of weighted networks and measured the 
method’s efficiency according to the running time and cost of the compressed network 
representation. Doer and Blenn [l3 | tested the convergence of different properties under 
three traversal algorithms applied to a single large social network. The findings of the 
aforementioned analyses indicate that the performances of simplification methods vary; 
however, the common weakness of these studies is the small set of networks considered. 

Despite the above-described efforts, several open questions remain concerning the 
simplification of complex networks, such as those regarding (Ql) how to evaluate the 
similarity between original and simplified network, (Q2) how small simplified networks 
should be and ultimately (Q3) what simplification method should be used. In this paper, 
we address these questions and propose an approach for assessing the effectiveness of the 
simplification process. We analyze 30 real-world networks of different size and origin 
under the effects of six different simplification methods. We compare the original and 
simplified networks based on several network properties (e.g., degree distribution, cluster¬ 


ing coefficient [23, betweenness centrality [23, degree mixing [23 and transitivity |27l |l 


(QlL The selection of these properties is supported by their common use in similar stud¬ 
ies [3, E3- Moreover, we propose a measure for determining the most appropriate size 
of simplified networks for preserving the observed properties (Q2) and for determining 
under which method the simplified networks fit the original ones most closely (Q3). We 
also study the impact of the original network size and type on the effectiveness of the 
simplification process. 
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Figure 1: Sampling methods applied to a small sample network for s = 0.5. Black nodes represent 
simplified networks obtained with [(^ selecting nodes uniformly at random (RN), [(^ selecting nodes 
with probability proportional to degree (RD), [(^ selecting links uniformly at random (RL) and |(d)| 
performing breadth-first search starting at a randomly selected node (BF). In the last method, BF 
ensures a connected network, whereas in other methods, this is not always the case. 




The rest of the paper is structured as follows. Section [5] focuses on the simplification 
methods and real-world networks used in the study and describes the proposed measure. 
In section[31 we report and formally discuss the results of the analysis. Finally, section^] 
concludes the paper and suggests directions for future research. 


2. Methods and data 


2.1. Simplification methods 

Several authors have proposed a broad collection of simplification methods, which 
can be divided into two general classes. Those in the first class are sampling methods 
in which a simplified network is represented by a random s amp le of the original network 
(e.g., random node selection [^, random link selection [U, snowball sampling (^ . 
random walk sampling Q and forest fire i). Methods in the second class obtain sim¬ 
plified networks by merging nodes and links into supernodes and superlinks based on 
different characteristics, such as the distance between nodes (e.g., cluster-growing and 
box-tiling renormalization [HI), node and link attributes (e.g., link weights and node 
attributes j^) or community structure (e.g., balanced propagation and modularity op¬ 
timization (34|)- 

In this study, we adopt four basic sampling methods (Fig. [1]). Random node [2^ 
(RN) and random link selection [2^ (RL) create sampled networks with nodes or links 
selected uniformly at random. Simplified networks under random node selection based 
on degree Q (RD) consist of randomly selected nodes, where the probability of selecting 
a node is proportional to the node’s degree. In breadth-first sampling (BF), a random 
node with its broad neighborhood is selected into the sample using the breadth-first 
search strategy. The main advantages of these methods are simplicity, and thus efficient 
implementation with low time complexity, and adjustability, which enables setting the 
size of the simplified network in advance. 

Sampling methods outperform merging ones in terms of the advantages listed above. 
Still, we consider two methods from the merging class (Fig. [5]). We use merging nodes 
based on community detection, where supernodes are identified by communities revealed 
by balanced propagation [s^ (BP). We also employ cluster-growing renormalization 311 
(CG), which incrementally grows supernodes from randomly selected seed nodes within 
a distance not larger than c (the nodes within one supernode are at most 2 • c -I-1 steps 
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Figure 2: Merging methods applied to a small sample network. The shape of the nodes indicates ] (a) [ the 
nodes’ community membership (BP) and |(b)| whether the nodes are at a distance less than 5 (c = 2) 
within one box (CG). Communities and boxes are marked by a gray contour. The simplified network 
(shown for both cases in|(c)|l is obtained by merging nodes inside one community or box into supernodes. 


apart). Both methods proved well in analyzing the invariance of network density under 
different renormalizations [s^. 

We define s as the number of nodes in the simplified network, measured as the fraction 
of nodes in the original network. For sampling, we set the sizes of the simplified networks 
as varying from 1% to 50% of the original networks (s = 0.01 and s = 0.05-0.50 with a 
step size of 0.05). For BP, we set the parameters of the algorithm as suggested in [35 1. 
With CG, we cannot control the size of the simplified network; still, we can change the 
distance between the nodes within one supernode. Therefore, the parameter c ranges 
from two to six, where smaller values indicate a smaller number of nodes within one 
supernode and thus a larger simplified network. 


2.2. Network data 

A diverse set of real-world systems is analyzed. We consider 30 networks of different 
origin (e.g., information, technological and social) and size (varying from a few thousand 
to a few hundred thousand nodes), listed in Table[TJ Due to the large number of networks 
considered, a detailed description is omitted here. 

For BP, CG and BF, all networks are considered to be undirected, although some 
of them are directed. To avoid comparing networks of different complexity, we remove 
self-loops and multiple links from all networks for simplification via merging methods. 

2.3. Assessment approach 

To perform a fair and sound assessment, we first address the aforementioned questions 
concerning the comparison approach (Ql) and the size to which a certain network should 
be simplified (Q2). To address Ql, we select a set of local and global network properties 
to be observed. To address Q2, we introduce a simple measure that takes into account all 
of the selected properties and for each network calculates the simplified size that would 
best preserve the observed properties. The specific size of the simplified networks is then 
used in a further analysis to compare the selected simplification methods (Q3). 


2.3.1. Comparing original and simplified networks 

We compare networks based on eight fundamental global and local properties. The 
global properties are expressed by a single value for each network and include density 
(the ratio of existing links to all possible links), degree mixing (the tendency of nodes 
connecting to similar ones 2^) and transitivity (the number of closed triplets over the 
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Table 1: Real-world networks (n and m correspond to the number of nodes and links, respectively). 


Network 


High E. Particle Phys. [3^ 
High E. Phys. 

NBER US patents [3^ 
Citeseer publications [3^ 


PGP web-of-trust m 
High E. Phys. archive I 
Astro Phys. archive [4^ 
Cond. Matters archive 
Computer science m 
Digg user reply [4^ 

Emails at Enron [4^ 
Facebook wall post [4^ 
Emails at EU res. inst. 141J 
Amazon products 1 
Amazon products 2 
Flickr images metadata [43 
Oregon aut. systems [4^ 
Gnutella file sharing 1 f4E 
Gnutella file sharing 2 f4E 


Foldoc dictionary m 

Wikipedia votes 1^ 
Brightkite friendship 
Epinions trust 
Slashdot friendship [5^ 
Wikipedia interactions 
Gowalla friendship 
Broad-topic queries [5^ 
google.com internal [Sm 
nd.edu domain [5^ 
Baidu articles [5^ 


Type 

n 

m 


27240 

342437 

Citation 

34546 

240548 

421578 

561060 


384413 

1764929 


10680 

24340 


12008 

237010 

Collaboration 

18772 

396160 


23133 

186936 


317080 

1049866 


30398 

87627 

Communication 

36692 

46952 

367662 

876993 


265214 

420045 

Co-purchase 

334863 

403394 

925872 

3387388 

Co-occurence 

105938 

2316948 


22963 

48436 

Internet 

36682 

88328 


62586 

147829 

Information 

13356 

120238 


7115 

103689 


58228 

214078 

On-line social 

75879 

82168 

508837 

948464 


186485 

740397 


196591 

1900654 


6175 

16150 

Web graph 

15763 

325729 

171206 

1497134 


415641 

3284387 


total number of triplets [13). The local properties are described by a distribution for 
all nodes in the network and comprise degree, in-degree and out-degree (the number of 
neighbors of each node), local clustering coefficient (the proportion of connected neigh¬ 
bors of each node [1^) and betweenness centrality (the number of shortest paths between 
all nodes going through each node (^b 

For comparison, we define two similarity measures, one based on the selected global 
properties and one on the selected local properties. The global similarity measure is 
used to determine how correlated the global properties in the observed original networks 
and their simplified version are. The correlation is measured with Spearmans correlation 
coefficient p. p indicates the extent to which one variable decreases as another increases. 

In our analysis, we calculate p for each selected simplification method and each size of 
the simplified networks for all networks together. 

The comparison based on the selected local properties is expressed using the Kolmogorov- 
Smirnov Z?-statistic (Kolmogorov-Smirnov test checks the null-hypothesis, i.e., that the 
distributions of two properties are the same; the D-statistic measures the distance be¬ 
tween the observed distributions). The U-statistic for each network and its simplified 
version is calculated for each simplification method separately. The values for compari¬ 
son based on p and the Z3-statistics are averaged over ten simplifications of each network, 
each simplification method and each size of the simplified networks. 
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Table 2: An illustrative example of the assessment approach, (left) The results of a comparison between 
simplified and original networks based on global properties, (right) The results after ranking sizes for 
each property. Pi denotes properties and Si sizes of simplified networks. 



PI 

P2 

P3 


PI 

P2 

P3 

Sum 

A 

SI 

0.84 

0.69 

0.75 

SI 

5 

5 

5 

15 

1.000 

S2 

0.88 

0.89 

0.87 

S2 

4 

4 

4 

12 

0.800 

S3 

0.90 

0.92 

0.89 

S3 

3 

3 

2 

8 

0.533 

S4 

0.96 

0.95 

0.88 

S4 

0 

1 

3 

4 

0.267 

S5 

0.91 

0.94 

0.92 

S5 

2 

2 

0 

4 

0.267 

S6 

0.93 

0.96 

0.90 

S6 

1 

0 

1 

2 

0.133 


The selection of properties and their relevance in assessing the effectiveness of network 
simplification greatly depends on the purpose of the simplification being performed. The 
selection of particular properties in this analysis is only supported by their common use 
in similar studies (e.g., [9|,ll0j) and serve to demonstrate the effectiveness of the proposed 
approach. Note that comparing networks based on other sets of properties may lead to 
different results. 

In the literature, we can find studies that have performed similar comparisons to 
a limited extent. In 12| the authors proved that RN does not preserve the degree 


distribution of scale-free networks. Moreover, RN and RL sampling are biased toward 
nodes with high degrees, which affects the degree distribution Q. However, Lee et al. [l^ 
proved that RN and RL overestimate the degree and betweenness centrality exponent, 
whereas both methods retain the assortativity of original networks. Merging methods 
decreases the density 3^ , but the relationship between density and network size remains 
invariant after simplification. 


2.3.2. Determining simplified network sizes 

To determine the size to which a specific network can be decreased while preserving 
most of the observed properties, the following approach is used. For each simplification 
method and each global and local property, we rank sizes with respect to p and the 
Z3-statistic, respectively. The network size that best fits a specific property receives rank 
0, the next best one receives rank 1 and so on. Next, we sum the ranks for each size 
and divide the sum by the greatest possible sum of ranks to normalize the result to the 
interval [0,1]. Thus, the measure A is defined as 


A = 


(us 


( 1 ) 


where Ug denotes the number of different sizes, denotes the number of properties, 
i indexes the properties (the order is not important) and is the rank of the f—th 
property. A is thus the normalized total rank assigned to a specific size by the observed 
properties. 

Table [5] shows an example of the measure A calculated by comparing six different 
sizes for a simplified network, taking into account the measure p a specific size receives 
for each of the three observed global properties. In this example, the most appropriate 
size for preserving global properties is S6. 


6 














2.3.3. Comparing simplification methods 

Finally, we compare the different methods for a given size of a simplified network. 
We rank the methods and measure their effectiveness using a modified version of the 
measure A described in the previous subsection: 



( 2 ) 


where Um is the number of different methods. 

With the described measure, we regard all properties as equally important. Still, 
depending on the purpose of the simplified networks considered and the method by 
which those networks are analyzed, one property can be more essential than another. 
With respect to importance, we can assign weights w to the properties; thus, the measure 
A becomes 



(3) 


where w is the vector of weights and thus Wi denotes the weight of property i. 

For simplicity, we omit the analysis performed based on the measure Am and thus 
assume all properties are equally important. 

3. Analysis and discussion 

The analysis consists of two stages. First, we determine the size of the simplified net¬ 
works that ensure adequate preservation of the observed properties. Second, we compare 
the effectiveness of different methods for a specific size of the simplified networks. 

3.1. Effectiveness of the simplification process with respect to the size of the simplified 


networks 


First, we analyze the effect of simplified network size on the effectiveness of the 
simplification process. As expected, the results reveal that in the majority of cases, the 
largest simplified networks (c = 2 for CG and s = 0.5 for sampling methods) are more 
similar to the original networks and thus better fit the original networks’ properties. 
However, the main goal of the simplification is to sufficiently reduce large networks 
to allow for easier analysis and understanding, which is achieved when the simplified 
networks are smaller. Therefore, we define the best size as the local minimum of A 
achieved at the smallest simplified network size (we assume that A = I for s = 0 and 
take the global minimum if it is also local). 

3.1.1. Analysis of the sampling methods 

The analysis of the sampling methods reveals a high level of diversity in their effec¬ 
tiveness (Fig. [3] and Table [3]). Fig. |3(a)] shows that under simplification methods RN 
and RL, local properties are best preserved for the largest size of the simplified networks 
(s = 0.5). In contrast, RD and BF perform best for smaller sizes, between s = 0.01 and 
s = 0.15, for the majority of the networks (i.e., the local minimum of A is around these 
values for most of the networks). 
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(a) (b) 
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(c) 

Figure 3: The results for the sampling methods. |(a)| Portion of networks with the best size equal to s. 
|(b)| Distance between the original and simplified networks (average A over all networks) based on the 
local properties. |(c)| Distance between original and simplified networks based on the global properties. 


Fig. |3(b)| and 3(c) shows the average A over all networks for the local and global 
properties, respectively. For the former, all methods behave in a similar manner. In 
particular, the best fit of local properties is reached for larger simplified networks; still, 
RD and BF show some deviation, indicating that for several networks smaller sizes also 
provide good fits. For the global properties, RN and RL show similar behavior again 
because the best preservation is achieved on smaller simplified networks (s = 0.15). For 
BF and RD, the local and global minima are reached for larger simplified networks. 

Table[4]shows the best sizes of simplified networks for the preservation of each network 
property. RN and RL perform similarly because both provide better preservation of 
local properties for the largest simplified networks. On the other hand, for RD, degree is 
best preserved for smaller networks, whereas for medium-sized networks, out-degree and 
clustering are best preserved. For BF, distributions of degree, out-degree and in-degree 
change the least for s = 0.01,0.15. However, the methods behave in a different manner 
when preserving global properties. Only RN preserves density and degree mixing well 
on smaller simplified networks, whereas RD, BF and RL work best for s = 0.5. 

Finally, we analyze how the preservation of local properties depends on the size and 











Table 3: The best sizes c or s for the preservation of the local properties with corresponding A. 


Network 

CG 

RD 

BF 

High E. Particle Phys. 

2 (0.00) 

0.20 

(0.18) 

0.05 (0.38) 

High E. Phys. 

2 (0.08) 

0.25 

(0.14) 

0.10 (0.44) 

NBER US patents 

2 (0.25) 

0.35 

(0.20) 

0.01 (0.44) 

Citeseer publications 

2 (0.00) 

0.20 

(0.08) 

0.01 (0.58) 

PGP web-of-trust 

4 (0.33) 

0.25 

(0.40) 

0.05 (0.77) 

High E. Phys. archive 

2 (0.00) 

0.05 

(0.73) 

0.05 (0.83) 

Astro Phys. archive 

2 (0.00) 

0.50 

(0.17) 

0.05 (0.50) 

Cond. Matters archive 

2 (0.00) 

0.50 

(0.13) 

0.05 (0.70) 

Computer science 

2 (0.17) 

0.50 

(0.00) 

0.50 (0.00) 

Digg user reply 

2 (0.25) 

0.10 

(0.20) 

0.05 (0.28) 

Emails at Enron 

2 (0.00) 

0.01 

(0.57) 

0.50 (0.00) 

Facebook wall post 

2 (0.00) 

0.10 

(0.18) 

0.01 (0.40) 

Emails at EU res. inst. 

2 (0.00) 

0.50 

(0.00) 

0.15 (0.60) 

Amazon products 1 

4 (0.33) 

0.50 

(0.00) 

0.50 (0.00) 

Amazon products 2 

2 (0.00) 

0.50 

(0.00) 

0.50 (0.02) 

Flickr images metadata 

3 (0.33) 

0.01 

(0.80) 

0.25 (0.37) 

Oregon aut. systems 

2 (0.17) 

0.40 

(0.23) 

0.01 (0.93) 

Gnutella file sharing 1 

5 (0.58) 

0.15 

(0.34) 

0.15 (0.34) 

Gnutella file sharing 2 

5 (0.42) 

0.15 

(0.34) 

0.10 (0.36) 

Foldoc dictionary 

2 (0.17) 

0.50 

(0.00) 

0.50 (0.02) 

Wikipedia votes 

2 (0.00) 

0.01 

(0.26) 

0.01 (0.76) 

Brightkite friendship 

2 (0.00) 

0.05 

(0.37) 

0.05 (0.83) 

Epinions trust 

2 (0.00) 

0.01 

(0.94) 

0.01 (0.70) 

Slashdot friendship 

2 (0.17) 

0.01 

(0.30) 

0.01 (0.58) 

Wikipedia interactions 

4 (0.47) 

0.01 

(0.42) 

0.05 (0.44) 

Gowalla friendship 

2 (0.00) 

0.05 

(0.33) 

0.05 (0.03) 

Broad-topic queries 

4 (0.33) 

0.10 

(0.26) 

0.15 (0.34) 

google.com internal 

2 (0.00) 

0.15 

(0.34) 

0.15 (0.36) 

nd.edu domain 

4 (0.08) 

0.01 

(0.48) 

0.50 (0.32) 

Baidu articles 

2 (0.00) 

0.10 

(0.22) 

0.05 (0.50) 


type of the original networks (Table [3]) . We omit the results for RN and RL because 
in all cases except two, the best size is s = 0.5. In contrast, the effectiveness of RD 
is partially correlated to the original network size because medium-sized networks (n = 
50000 — 200000) are best preserved for smaller simplified network sizes (s = 0.01 — 0.1), 
whereas large networks (n = 200000 — 500000) are best preserved for larger values of s. 
However, as indicated by the dependence on network type, the local properties of on-line 
social networks and Web graphs are best preserved for smaller sizes s = 0.01 — 0.15, 
whereas the local properties of citation and co-purchase networks are best preserved for 
s = 0.25 — 0.35. All differences are statistically significant (p < 0.05, one-way ANOVA), 
which rejects the null hypothesis that there is no dependence between the effectiveness 
of property preservation and network type. For both RD and BF, only the properties of 
co-purchase and information networks are best preserved for s = 0.5. The results reveal 
no statistically significant influence of network size or type on the performance of BF. 

3.1.2. Analysis of the merging methods 

The analysis of CG proves that the local network properties are best preserved when 
c = 2 for 22 out of 30 networks (Fig. |4(a)] and Table [3]). Fig. |4(b)| shows the average A 
over all networks based on the local and global properties. The local properties are best 
fitted for larger simplified networks (c = 2), whereas for c = 3,4 the simplified networks 
best fit the global properties of the original networks. 
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(a) 


(b) 


Figure 4: The results for cluster-growing simplification. | (a) [ Portion of networks with the best size equal 
to c. 1(b) I Distance between the original and simplified networks {A for the global and average A over all 
networks for the local properties) as a function of c. 


Table 4: The best sizes c or s for the preservation of local network properties with corresponding A, and 
p for the global properties. 


Property CG RN RD BF RL 


Degree 

2 

(0.04) 

0.50 

(0.00) 

0.15 

(0.52) 

0.15 

(0.61) 

0.50 

(0.00) 

In degree 


- 

0.50 

(0.00) 

0.50 

(0.26) 

0.01 

(0.75) 

0.50 

(0.02) 

Out degree 


- 

0.50 

(0.00) 

0.30 

(0.43) 

0.01 

(0.61) 

0.50 

(0.01) 

Clustering 

2 

(0.25) 

0.50 

(0.00) 

0.25 

(0.44) 

0.50 

(0.10) 

0.50 

(0.00) 

Betweenness 

2 

(0.00) 

0.50 

(0.00) 

0.50 

(0.08) 

0.50 

(0.05) 

0.50 

(0.01) 

Density 

2 

(0.89) 

0.10 

(0.97) 

0.45 

(0.95) 

0.50 

(0.95) 

0.50 

(0.91) 

Degree mixing 

3 

(0.34) 

0.35 

(0.66) 

0.50 

(0.77) 

0.50 

(0.97) 

0.50 

(0.63) 

Transitivity 

4 

(0.36) 

0.50 

(0.99) 

0.50 

(0.99) 

0.50 

(0.99) 

0.50 

(0.83) 


Table m shows the results obtained for the preservation of each property. Most of the 
properties are best preserved for larger simplified networks (c = 2), with the exception 
of degree mixing and transitivity, where c = 5 and c = 6, respectively. 

The best size for preserving local network properties (Table (S]) does not depend on the 
original network size or type (i.e., the differences in property preservation, which would 
depend on the size and type of the original networks, are not statistically significant). 
Still, if we divide the networks roughly by type, i.e., information, social and technological, 
the correlation between the type and the effectiveness becomes statistically significant 
(i.e., the null hypothesis that there are no differences in property preservation, which 
would depend on network type, is rejected, with p < 0.05, one-way ANOVA). Thus, the 
local properties of social networks are best preserved for c = 2, in contrast to the case of 
technological networks, for which c > 2. 

3.1.3. Discussion 

The findings of the first part of the study confirm the negative correlation between 
the size of the simplified networks and their similarity to the original networks because 
larger simplified networks are more similar to the original ones in most cases. The latter 
has also been proved by other studies, for example, [lOj. RD and BF are more effective 
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Table 5: The best, second-best and worst methods for the preservation of local network properties with 
corresponding A, and p for the global properties. 


Property Best Second-best Worst 


Degree 

BF 

(0.25) 

RD 

(0.26) 

RL 

(0.84) 

In degree 

RD/BF (0.26) 

RL 

(0.70) 

RN 

(0.77) 

Out degree 

RD 

(0.32) 

BF 

(0.33) 

RL 

(0.70) 

Clustering 

RD 

(0.30) 

BF 

(0.35) 

RL 

(0.81) 

Betweenness 

BF 

(0.21) 

RD 

(0.27) 

BP 

(0.75) 

Density 

RN 

(0.96) 

BF 

(0.91) 

BP 

(0.76) 

Degree mixing 

BF 

(0.92) 

RN 

(0.62) 

BP 

(0.21) 

Transitivity 

RN 

(0.94) 

RD 

(0.92) 

CG 

(0.22) 


for smaller simplified networks, which is consistent with the findings of other authors. 
Particularly, Doerr and Blenn 1^ revealed a solid estimate of an original network for 
s = 0.2 — 0.3 and s = 0.1 in the case of preserving average node degree and the power-law 
degree exponent, respectively. In addition, Leskovec and Faloutsos obtained a good 
fit for original networks under several sampling methods for s = 0.15. Thus, our results 
advance those reported in these studies and reveal distinctions in the extent of property 
preservation among different types and sizes of networks, which are the most obvious for 
RD. 


3.2. Comparison of the effectiveness of the simplification methods 

In the second part of our study, we compare the performance of different simplification 
methods. We focus on size c = 2 for CG and s = 0.1 for sampling methods for two 
reasons. First, we select s = 0.1 as the middle size among the best sizes determined in 
the first part of the study. Second, s = 0.1 is suitable for the comparison of BP and CG, 
for which the mean sizes of simplified networks are s = 0.03 and s = 0.12, respectively. 

3.2.1. Analysis 

First, we determine the best method for preserving a specihc property (Table O. 
Global properties are best preserved under RN and BF, whereas merging methods provide 
the worst preservation. Fig. [S] compares the best, second-best and worst methods with 
respect to all global properties. For local properties, BF and RD perform the best, 
particularly BF for the degree and betweenness centrality, whereas RD performs best for 
the out-degree and clustering. However, RL proves to be the worst method because it 
preserves the degree, out-degree and clustering to the lowest extent. Examples of local 
property preservation for the analyzed networks are presented in Fig. [51 

For a complete assessment of the effectiveness of the simplification methods, we com¬ 
pare the performance of the methods for each network based on the preservation of local 
properties. Results are represented in Table HI For 23 networks, the best methods are 
RD and BF. The analysis reveals a dependence between network type and method effec¬ 
tiveness because BP performs the best for on-line social networks and BF performs the 
best for Internet and co-purchase networks. The differences among the network types are 
statistically significant [p < 0.05, one-way ANOVA). For the second-best methods, the 
distinctions are less evident. Still, BP proves to be effective for other types of networks 
(Internet, communication networks, Web graphs). The worst method for preserving lo¬ 
cal properties is RL (for 22 networks), followed by RN (for 8 networks). On the other 
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Table 6: The best, second-best and worst methods for preserving local properties of networks with 
corresponding A. 

Network Best Second-best Worst 


High E. Particle Phys. 

RD 

(0.10) 

BF 

(0.17) 

RL 

(0.97) 

High E. Phys. 

BF 

(0.07) 

RD 

(0.20) 

RL 

(0.96) 

NBER US patents 

BF 

(0.07) 

BP 

(0.13) 

RN/RL (0.80) 

Citeseer publications 

RD 

(0.07) 

BF 

(0.20) 

RL 

(0.93) 

PGP web-of-trust 

CG 

(0.13) 

BP 

(0.20) 

RN 

(0.93) 

High E. Phys. archive 

RD 

(0.07) 

BF 

(0.20) 

RL 

(0.93) 

Astro Phys. archive 

RD 

(0.07) 

BF 

(0.13) 

RL 

(1.00) 

Cond. Matters archive 

BF 

(0.00) 

RD 

(0.20) 

RL 

(1.00) 

Computer science 

BF 

(0.07) 

RD 

(0.27) 

RL 

(1.00) 

Digg user reply 

RD 

(0.17) 

CG/BP (0.33) 

RL/RN (0.60) 

Bmails at Enron 

BP 

(0.27) 

RD 

(0.33) 

RN 

(0.73) 

Facebook wall post 

RD 

(0.07) 

BP 

(0.17) 

RL 

(1.00) 

Emails at EU res. inst. 

RL 

(0.13) 

BP 

(0.20) 

RD 

(0.73) 

Amazon products 1 

BF 

(0.00) 

BP 

(0.27) 

RL 

(1.00) 

Amazon products 2 

BF 

(0.03) 

CG 

(0.10) 

RL 

(1.00) 

Flickr images metadata 

RD/BF (0.33) 

RN 

(0.47) 

RL 

(0.73) 

Oregon aut. systems 

RD 

(0.07) 

BP 

(0.20) 

RN 

(0.80) 

Gnutella tile sharing 1 

BF 

(0.13) 

BP 

(0.30) 

RL 

(0.70) 

Gnutella file sharing 2 

BF 

(0.13) 

BP 

(0.30) 

RL 

(0.70) 

Foldoc dictionary 

BF 

(0.03) 

CG/BP (0.13) 

RL 

(1.00) 

Wikipedia votes 

BP 

(0.13) 

RN 

(0.27) 

BF 

(0.60) 

Brightkite friendship 

RD 

(0.13) 

BP 

(0.20) 

RL 

(0.93) 

Epinions trust 

BP 

(0.03) 

RL 

(0.17) 

BF 

(0.87) 

Slashdot friendship 

BP 

(0.07) 

RD 

(0.23) 

RL 

(0.83) 

Wikipedia interactions 

BP 

(0.07) 

BF 

(0.33) 

RN 

(0.40) 

Gowalla friendship 

RD 

(0.07) 

BP 

(0.27) 

RL 

(1.00) 

Broad-topic queries 

RD 

(0.07) 

BP 

(0.27) 

RN 

(0.73) 

google.com internal 

RD 

(0.10) 

BF 

(0.17) 

RL 

(0.93) 

nd.edu domain 

CG 

(0.07) 

BP/BF (0.13) 

RL/RN (0.80) 

Baidu articles 

RD 

(0.00) 

BP 

(0.27) 

RL 

(0.90) 


hand, BF is the worst with respect to only two on-line social networks. The results also 
prove the statistically significant dependence (i.e., reject the null hypothesis that there 
are no dependencies between the network size and the effectiveness of the simplification 
methods, with p < 0.05, one-way ANOVA) between the worst method and network size. 
For smaller networks (n < 50000), the worst method for preserving local properties is 
RL, whereas for larger ones, the worst method is RN. 

3.S.2. Discussion 

The results of the second part of the study reveal several distinctions in the behavior of 
the simplification methods. RD and BF proved the best for preserving the local properties 
of networks, whereas for global properties, RN outperforms the other methods. However, 
RL and merging methods show the worst performance. These findings are consistent with 
the results of the study reported in [^, where RD had a better performance than RN 
and RL (other methods are not considered in the aforementioned study). 

In addition to comparing the methods for s = 0.1, we also compare them for larger 
simplified networks (s = 0.5). The results are not presented because there are no sig¬ 
nificant changes in the results (i.e., the same methods are the best and the worst for 
s = 0.1). 

In addition, we observe how the size of the largest weakly connected component 
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Figure 5: Relationship between the global properties of the original and the simplified networks for the 
best, second-best and worst method. |(a)| Density. |(b)| Degree mixing. | (c) | Transitivity. 


(LWCC) changes under simplification to explain the differences in the methods’ perfor¬ 
mance. The LWCC of the original networks, on average, consists of 59% of all nodes. 
The size of the LWCC of the simplified networks under all methods depends strongly on 
the simplified network size (i.e., the size of the LWCC of the smallest simplified networks 
is the smallest). However, RN and RL show similar performance because the sizes of the 
LWCC for both methods vary from 1% for s = 0.1 to 40% for s = 0.5. Still, RL produces 
the most disconnected components. In contrast, simplification via RD and BF produces 
networks with a clearly larger LWCC because the sizes vary from 25% for s = 0.01 to 
60% for s = 0.5. Therefore, networks simplified by RD and BF feature a larger LWCC 
and smaller number of components, which is more similar to the characteristics of the 
original networks. Based on this finding, the predominance of RD and BF over RN and 
RL can be confirmed. 
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Figure 6: Examples of comparison of the local properties for the original and simplified networks with the 
best, second-best and worst methods. |(a) [ Degree distribution. |(b)| In-degree distribution. |(c)| Out-degree 
distribution. |(d)| Cumulative distribution of clustering. |(e)| Cumulative distribution of of betweenness 
centrality. 
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4. Conclusions 


Network simplification is an adequate tool for studying large networks for several 
reasons. In addition to the obvious advantages, including faster analysis and more effi¬ 
cient visualization, the simplification can significantly improve the understanding of large 
networks. For example, data regarding the systems described by networks can often be 
missing or incomplete, and thus, networks can be considered a sampled variety of the 
original systems (e.g., identifying Internet map 5^, For this reason, understanding 
how similar the original and sampled system are is essential. 

This study addressed three aspects of real-world network simplification. First, we 
focused on a comparison of original and simplified networks. Second, we determined what 
size of simplified network most adequately fits the properties of the original networks. 
Finally, we compared the effectiveness of several simplification methods. We analyzed 
six simplification methods with respect to 30 real-world networks and compared the 
simplified and original networks based on several properties, including degree, in-degree, 
out-degree and betweenness centrality distribution, clustering coefficient, density, degree 
mixing and transitivity. 

The results show that the goodness of property preservation depends on the size of 
the simplified networks. Larger simplified networks fit original networks better; never¬ 
theless, properties are adequately preserved for smaller sizes close to 10% the size of the 
original networks, especially for random node selection based on degree and breadth- 
first sampling. Thus, the decision regarding how small a simplified network should be 
depends on the size of the original network and the purpose of the simplified network. 
If we can simplify a network by 50%, we can provide for the best fit of the original 
network properties. However, if the network is large, 50% of the original size is not a 
sufficient reduction. In that case, 10% of the original network size allows for the adequate 
preservation of important properties. Furthermore, the findings of this study reveal that 
random node selection based on degree and breadth-first sampling are the best methods, 
whereas merging methods performed the worst. 

Future work will mainly focus on other characteristics that affect the effectiveness 
of the simplification process. Moreover, instead of focusing solely on similarities, we 
will analyze typical distinctions between original and simplified networks. Furthermore, 
other ways for comparing simplified networks with original for their similarity could also 
be considered, for example comparing the backbones of networks 6l|, their community 
structure or density of edges in subnetworks [g^. Based on this and future studies, 
a wide range of principles underlying the simplification of real-world networks could be 
extracted. The application of such principles should allow for the determination of the 
most suitable simplification method for specific networks, which would allow for more 
efficient simplification and a better understanding of large real-world networks. 
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